Project: Investigate a Dataset (TMDB Dataset)

Table of Contents


Introduction

TMDB dataset is a big dataset that contains different information like (popularity, user ratings, budget, revenue, releas date, etc..) about more than 10000 movies collected from tmdb, in this notebook we will go through the data analysis process of this dataset, trying to answer some questions and find patterns and correlations between different variables.

Important Note

Due to a wide missing of data in two important columns budget & revenue, we will have to perform two separate analyses based on two separate datasets.

The first dataset will drop all rows with missing values in the budget & revenue columns, which means reducing the number of observations to only 3854, The analysis of this dataset will focus on the correlation between budget or revenue with other featuers.

The second dataset will drop the entire two columns of budget & revenue, which means keeping the original number of observations after dropping nulls and duplicates, the analysis of this dataset will focus on correlation between other features like popularity, genres, and runtime.

This entire process is documented in the next section..

Questions for budget and revenue analysis:

  1. Is there a correlation between budget and revenue?
  2. Have the movies revenues increased over the years?

Questions for other movies features analysis:

  1. Which year has the most released movies?
  2. What are the most common genres in high rated movies?
  3. What are the most common genres in low rated movies?
  4. Is there a correlation between runtime and average user rating?

Data Wrangling

General Properties

The previous summary staistics shows that there are missing values that are recorded asZeros ..

Dropping all these missing values will cause a huge loss of information about other features..

In order to solve this problem we will have to seperate the dataset into two datasets:

The first dataset will drop all rows with missing values in the budget & revenue columns, which means reducing the number of rows to less than the half.

The second dataset will drop the entire two columns of budget & revenue, which means keeping the original number of rows.

Now lets check for missing values in runtime column to see if it's appropriate to drop them or not..

Dropping 31 rows from the dataset will not affect the quality of the analysis.


Data Cleaning


Now, In order to fix genres format, we need to do some steps:

  1. set id as index
  2. create a new dataframe from splitting the values of geners into columns
  3. use pd.stack() to stack the genres columns into rows
  4. join the new genres dataframe with the original dataframe on id

Later in the analysis section, we will join this genres_df with the original df to do the genres part of the analysis


Now let's split our dataset:

Exploratory Data Analysis

budget and revenue analysis:

(Analysis of 3854 observations)

Is there a correlation between budget and revenue?

Before we check revenue distribution,, let's define a box plot function to avoid code repeation:

Seems like there are some upper outliers..

The upper outliers are ok to keep since there is no resonable limitation for the maximum values of revenue or budget.

Although.. the minimum values in both columns are recorded as 2 and 1 dollars, this doesn't make sense. let's query the budg_rev_df again to see 5 rows of raw data of revenue values that are less than 10000 dollars for instance.

These values must have been wrongly entered. let's assure that by some manual search for thier actual values..

For example, by checking the movie "Shattered Glass" (with recorded revenue value of 2 dollars) page on 'imdb', in the Box office section, it appears that the real value of revenue is more than 2000000 dollars! image.png

By checking multiple rows with the same problem, it seems that these values are typos that were entered as ones instead of millions, often happend with revenue and budget values that are less than 1000.

The best solution here is trying to obtain the missing or wrongly entered data from another source but since we couldn't do that here and now, we can try to do another possible solution

We will create two dataframes, low_rev for the rows that its revenue is less than 1000, and low_budg for rows that its budget less than 1000, then fix these values in the new dataframe by multyplying each by 1000000, then append them again to the budg_rev_df, then create a new dataframe clean_budg_rev that drops typos values from the budg_rev_df


Now, let's go back to our analysis..

let's check revenue and budget distribution with box blot again..

The distribution of revenue is right skewed.

The median of movies revenue is 45.47 million USD.

50% of the revenue values are between 14 million USD and 124.9 million USD.

There are some upper outliers but they are ok to keep since there are no resonabe limitations to movies revenue values.

The distribution of budget is right skewed.

The median of movies budget is 24 million USD.

50% of the budget values are between 10 million USD and 50 million USD.

There are some upper outliers but they are ok to keep since there are no resonabe limitations to movies budget values.


Now let's answer our question..

Is there a correlation between budget and revenue?

There is no correlation between movies budget and movies revenues.


Have the movies revenues increased over the years?

Movies revenues have continuously increased over the years, with HIGHST of 26.20 billion USD for 2015.


Other movies features analysis:

(Analysis of 10812 observation)

Which year has the most released movies?

Now we will define a histogram function to avoid repetitive codes:

2014 has the most released movies with 696 Movies.

What are the most common genres in high rated movies?

First, we will check the average rating distribution

This is almost a normal distribution, so we will use the mean of vote_average to split the dataframe into high_rated and low_rated

Now we will plot a histogram to show the most common genres in the high rated movies..

The most common genres in high rated movies:

  1. Drama
  2. Comedy
  3. Thriller
  4. Action
  5. Romance

What are the most common genres in low rated movies?

The most common genres in low rated movies:

  1. Comedy
  2. Drama
  3. Thriller
  4. Action
  5. Horror

Is there a correlation between runtime and average rating?

Checking 'runtime' distribution:

The distribution of runtime is right skewed.

The median of movies runtime is 99 mins.

50% of runtime values are between 90 mins and 112 mins.

There are many upper outliers and some lower outliers but they are ok to keep since thier values are associated with very long or very shot movies, and there is no reason to drop them from the dataset.

__

Now let's create a scatter plot to check the correlation between movies runtime and average rate

There is no correlation between movies average rate and movies runtime.


Conclusions

budget and revenue conclusions:

(Analysis of 3854 observations)

1. There is no correlation between movies budget and movies revenues.

2. Movies revenues have continuously increased over the years, with HIGHST of 26.20 billion USD for 2015.

Other movies features conclusions:

(Analysis of 10812 observations)

1. 2014 has the most released movies with 696 Movies.

2. The most common genres in high rated movies:

3. The most common genres in low rated movies:

4. There is no correlation between movies average rate and movies runtime.


Limitations:

1. Alot of missing values:

2. Data entry typos:

3. Structural problems:

4. A lot of outliers: